Predicting Earthquake Building Damage

Haley Egan / vkb6bn
DS 6040 / Summer 2022

Introduction

In 2015 there was a 7.8 magnitude earthquake near Gorkha, Nepal. The damage devastated the community. About 9,000 people were killed, millions of people lost their homes, and there was about $10 billion in damage. The Nepalese government is still working to rebuild the affected area. During the years of recovery, the National Planning Commission, Kathmandu Living Labs, and the Central Bureau of Statistics have collected one of the largest post-disaster datasets ever. This data contains valuable information on earthquake structural damage and socio-economic impacts.

This project is inspired by the Richter’s Predictor: Modeling Earthquake Damage data challenge hosted by DrivenData.org. The project harnesses the dataset on the 2015 Nepalese earthquake to model and predict the level of damage buildings suffered as a result of the earthquake. The insights gained from these predictions can be applied to future earthquakes, and ideally in preventing this level of devastation and fatalities from occurring again.

150427-nepal-quake-jhc-1326.jpg

Data

The data comes from the 2015 Nepal Earthquake Open Data Portal. The data contains 38 features, including structural information like number of floors, age of building, and type of foundation, as well as ownership status, building use, and number of families who lived there. Eight of the 38 features are categorical, resulting in mixed data types, which is addressed in the Data Cleaning & Preparation section.

Each building has a unique id, and has a corresponding damage grade (1 = low damage, 2 = medium damage, 3 = complete destruction). This is a large dataset, with 208,480 individual buildings (rows) documented. The features are used to predict the damage level of each building on the test data set.

For training and testing purposes, the full data set was split into a training data set, which contains 80% of the original data, and a testing data set, which contains the remaining 20% of the original data. All data cleaning, transformations, and feature reductions were applied to both the training and testing data sets. Model predictions were made on the training data set, and then tested with the testing data set, as is standard for this type of statistical analysis and modeling.

Model Selection

Bayesian Data Analysis techniques are used to model and predict this data set. Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Logistic Regression are the models used and compared in this report to predict building damage levels. The data is processed and modeled in Python, using packages such as pymc, pandas, numpy, and sklearn.

Markov chain Monte Carlo (MCMC) is used to find the posterior distribution of the model parameters (predictors and response). Trace plots of the MCMC sample are used to determine whether or not the sampler converges. The goal is to determine if there is evidence of divergent chains. Density plots are also used to visualize the posterior distributions for the parameters.

Priors

For LDA and QDA, there are two sets of priors used for each model. The first set of priors are gathered from initial observations of building damage through photos, videos, and articles of the 2015 earthquake. Based on this prior knowledge, it was initially observed that very few buildings experienced no or little damage (damage grade 1). Most buildings experienced some level of damage (damage grade 2), and many buildings experienced complete destruction (damage grade 3). Based on this prior knowledge, the damage grades were given the following prior value: damage grade 1 = 5%, damage grade 2 = 70%, damage grade 3 = 25%.

The second set of priors are used as a comparison to the first priors, and are calculated from the observed proportion of building damage per level. While this method is not based on ‘prior knowledge’, it can help in the understanding of how well the initial priors performed. For damage grade 1, the observed proportion (buildings per grade/total buildings) for the training dataset is 0.0965, about 9.7%. For damage grade 2, the observed proportion for the training set is 0.5695, or about 57%. For damage grade 3, the observed proportion for the training set is 0.3339, or about 33.4%.

Feature Selection

The original dataset contains 38 features. After using one-hot encoding to transform the categorical variables into their own columns of binary 0/1 integers, there are 70 features, or columns. For Bayesian data analysis, this is too many features for the models to run efficiently. Based on observations gained from the exploratory data analysis, the full data set was reduced to 26 features. The removed features did not appear to add value to the predictions of building damage grade, and some were too highly correlated, which could skew the results. The reduced data set was used for the LDA, QDA, and Logistic Regression models. Bayesian Model Averaging was used to further reduce the features, but this did not improve the overall performance of the model.


Data Cleaning & Preparation

After initial data cleaning and preparation, no missing data were found.

In the original dataset, 30 of the predictors are integers (type: int64), and 8 are categorical (type: object). 'One-vs-the-rest', or one-hot encoding, was used to create new binary columns for the categorical predictors. For example, the predictor 'foundation_type' has values 'h', 'i', 'r', 'u', 'w', indicating which type of foundation each building (row) has. After one-hot encoding, each value becomes its own column, with a binary value of 0 (False), or 1 (True). This technique was performed for each categorical variable, resulting in 30 new columns in the dataset.

Lastly, the response variable 'damage_grade' was transformed from an integer to a string, in order to perform LDA/QDA and Logistic Regression. For these models, classes are needed, requiring a categorical variable type.


Exploratory Data Analysis (EDA)

EDA was performed to gain initial insights into the data and their interactions. From a basic bar plot of each damage grade, it becomes apparent that most buildings have a damage grade of 2, with the second highest level of damage at grade 3. There are over 140,000 buildings that had medium damage, about 90,000 that had complete destruction, and only about 25,000 that had low damage.

Comparisons between damage grade and other predictors reveal which predictors may be most important. For example, the age of a building has some influence on the level of damage. Of all the predictors, the type of superstructure (materials of building that is above ground), appears to play the largest role in the level of damage of the earthquake to the building. Most buildings made of adobe mud experienced damage grade 3, complete destruction, while most buildings made of bamboo experienced damage grade 1, low damage. 'Foundataion_type_r' and 'foundation_type_h' also appear to be significant when assessing damage. Buildings with the highest damage grade most frequently have 'foundation_type_r' and 'foundation_type_h'.

When looking at overall correlation of all the predictors, there does not appear to be any significant correlation. This suggests that most, if not all, of the predictors can be used for modeling and predictions without skewing the results.


Models

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) estimates the probability that a new set of inputs belongs to appropriate damage grade class (1, 2, or 3). LDA models the likelihood of each damage grade class as a Gaussian distribution. LDA then uses the posterior distributions to estimate the class for each test point (Srivastava et al., 2007).

Priors based on Observational Knowledge

Test Data


Observed Proportion of Damage Grades

Test Data

Analysis of LDA Model

When using priors gained from observations of earthquake photos and videos, the LDA model performed better than when using observed proportion priors (actual number of building per grade divided by total number of buildings). For the first set of priors, the misclassification error rate was 56.1% on the training data, and 88.9% on the testing data. Therefore, the accuracy rate on the training data is 43.9%, and 11.1% on the testing data.

For the second set of priors from observed proportions, the misclassification error rate was 57.9% on the training data, and 89.3% on the testing data. Therefore, the accuracy rate on the training data was 42.1%, and 10.7% on the testing data. While both sets of priors had similar results, the first priors (from prior knowledge) perform slightly better on the training and testing data. That being said, the overall misclassification rate is very high, and the accuracy rate very low. The data suggests that nearly 90% of the test data set is being wrongly classified. This suggests that LDA is likely not the best model to use for predicting the damage grade of buildings from the earthquake.


Quadratic Discriminant Analysis (QDA)

Similarly to LDA, Quadratic Discriminant Analysis (QDA) estimates the probability that a new set of inputs belongs to appropriate damage grade class (1, 2, or 3). QDA also models the likelihood of each damage grade class as a Gaussian distribution, and then uses the posterior distributions to estimate the class for each test point (Srivastava et al., 2007).

Priors based on Observational Knowledge

Test Data


Observed Proportion of Damage Grades

Test Data

Analysis of QDA Model

When using prior knowledge priors, gained from observations of earthquake photos and videos, the QDA model had a misclassification rate of 74.9% on the training data, and a misclassification rate of 93.5% on the testing data. Thus, the accuracy rate on the training data was 25.1%, and had 6.5% accuracy on the testing data.

For the observed proportion priors (actual number of buildings per grade divided by total number of buildings), there was a misclassification rate of 75.3% on the training data, and 93.6% on the testing data. Thus, a 24.7% accuracy rate on the training data, and a 6.4% accuracy rate on the testing data. The first set of priors based on prior knowledge, performed slightly better than the second set observed proportion priors. However, the overall misclassification rate for both sets of priors is very high, with nearly total inaccurate classification of damage grade on the testing data. This suggests that the QDA model does not perform well on this data. Also, the QDA model performs worse than the LDA model, indicating that the QDA model should not be used for predicting the damage grade of buildings from this earthquake.


Logistic Regression

The outcome of Logistic Regression depends on the intercept, B0, and the slope, B1. The intercept shifts the curve right or left, and the slope controls the steepness of the curve. The probability found through the models helps quantify the uncertainty of the analysis. A Bernoulli distribution is used, since there are two binary inputs and outcomes. The priors are constant values distributed among all features, since it is unknown what impact each feature has on the data. Bayes theorem is used to find the posterior probability distribution of the model parameters. The parameters are the features, or predictors, and the response ‘damage_grade’.

In order to use Logistic Regression, binary classifiers are needed. The earthquake damage data has three classifiers, Grade 1, Grade 2, and Grade 3. For this model, these three classifiers were transformed into 2 classifiers. If the damage grade level is 1, it was given the value 0. If the damage grade was 2 or 3, they were both given the value 1. This allows for the running of a logistic regression model, since the classifiers were converted to binary classifiers. For the training data set, there are 20,033 buildings mapped to 0 (damage grade 1), and 188,447 buildings mapped to 1 (damage grade 2 and 3). For the test dataset, there are 5,091 buildings mapped to 0m and 47,030 buildings mapped to 1.

Bayesian Model Averaging

Bayesian Model Averaging learns the parameters for all candidate models, and then combines the estimates according to the posterior probabilities of the models. In BMA, a parameter is obtained by averaging the estimates of different models, and weighted by the model probability (Hinne et al., 2020).

Test Data


Test Data

Analysis of BMA & Logistic Regression Models

When conducting Bayesian Model Averaging, the reduced data set of 26 features was used for the first BMA model. The output table showed that several of the features have a probability and average coefficient close to 0. This suggests that these features do not fit the model well. Thus, 8 features with probabilities and coefficients close to 0 were removed from the data set. The BMA model was run again with the further reduced data set, which contained features that fit the model well.

Despite the model reduction to the best fitting features, the accuracy of the model did not improve at all. In fact, the results were identical. With the training data, the model with the 26 features had an accuracy of 0.9091, or 90.9%. With the testing data, the model with the 26 features had an accuracy of 0.9081, or 90.8%. With the reduced model using only the best fitting features, the BMA accuracy was 0.9091, or 90.9% with the training data.. The reduced model had an accuracy of 0.9081, or 90.8% on the training data. This suggests that the further reduced model with 18 features performs as well as the 26 feature model.

The accuracy of the Logistic Regression model with 26 features and the training data set, was 0.9091, or 90.9%. The accuracy of the 26 features on the testing data set was 0.9078, or 90.8%. The model with 26 features performed slightly better on the training set, but both were very similar, and both performed well. After the data set was reduced to 18 features using BMA, the Logistic Regression accuracy with the training data set was 0.9091, or 90.9%. The accuracy on the testing data set with 18 features was 0.9082, or 90.8%. The training data set performed slightly better than the testing data set. However, both are very close and perform well. The results of the Logistic Regression were the same for both the 26 feature model and the 18 feature model. This suggests that a further reduced model did not impact the Logistic Regression accuracy, so either set of data would work well for the regression model.


MCMC and Plots

For this data set, MCMC is used to find the posterior distribution of the model parameters. Trace plots are used to visualize whether or not there is evidence of divergent chains, and the density plots are used to visualize the posterior distributions for the parameters.

Analysis

When running the MCMC, only 11 of the features were used, due to the high computing power needed for MCMC. When running the model using NUTS, 4 chains were used, and no divergences were found. This can also be seen in the trace plots. The trace plots of the MCMC sample show that the sampler converges. There is no evidence of there being divergent chains

When looking at the density plots, each feature examined has a bell shaped curve of the posterior distribution. The feature with the smallest negative posterior distribution is ‘count_floors_pre_eq’, with a value of about -0.310. The feature with the smallest positive posterior distribution is ‘legal_ownership_status_w’, with a value of about 0.027. The feature with the largest positive posterior distribution is ‘foundation_type_r’, with a value of about 0.252. This information aligns with the observations from the exploratory data analysis, which suggest that ‘foundation_type_r’ may be one of the most important features for predicting the damage grade of buildings from the earthquake.


Conclusion

LDA, QDA, and Logistic Regression models were run using the Nepal earthquake building damage data, to determine which model best predicts the damage grade of each building. The analysis of both LDA and QDA shows strong evidence against these models being used for building damage grade predictions. LDA performed very poorly, with its best prediction accuracy rate being 43.9% on the training data, and 11.1% on the testing data. The QDA model performed even more poorly, with its best prediction accuracy rate being 25.1% on the training data, and 6.5% on the testing data. Therefore, LDA and QDA should not be used in predicting building damage levels for this earthquake dataset.

Logistic Regression by far performed the best of the three models. The highest performing model, which contained 18 features, had a prediction accuracy of 90.9% on the training data, and 90.8% on the testing data. Based on this outcome, Logistic Regression would be a good choice for making predictions on building damage grades.

Accuracy Training Data Testing Data
LDA 43.9% 11.1%
QDA 25.1% 6.5%
Logistic Regression 90.9% 90.8%
BMA 90.9% 90.8%

Due to time constraints, no further models have been examined at this time. However, running similar model analysis and predictive checks on other types of models is highly recommended. Since Logistic Regression was so successful, Multinomial Regression would also likely be successful, and thus worth exploration. More complex models may also be worth exploring, since the data is extensive and multi-faceted.

The goal of this project was to use Bayesian Data Analysis to find the best model for predicting damage grade of buildings affected by the 2015 Nepal earthquake. Ideally, this work could be applied to other similar use cases, and shed light on the patterns of damage caused by earthquakes, and how to prevent such destruction in impoverished communities.


References

Honor Code: I pledge that this work is my own.